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ABSTRACT 

The standard error of measurement as a means for 
estimating the margin of error that should be allowed for in test 
scores is discussed. The true score measures the performance that is 
characteristic of the person tested; the variations, plus and minus, 
around the true score describe a characteristic of the test. When the 
standard deviation is used as a measure of the variation of observed 
scores around the true score, the result is called the standard exror 
of measurement. The standard error of measurement can be used in 
defining limits around the observed score within which one would be 
reasonably sure to find the true score. Since, in practice, it is not 
possible to give a large number of equivalent forms of a test in 
order to find the characteristic standard error of measuremen'c, it is 
determined by the reliability coefficient* As measured by the 
reliability coefficient, reliability means consistency of 
measurement. It is unfortunately true that a test will have different 
reliability coefficients depending on the groups of people tested. 
The standard error of measurement is less subject to this variation* 
The formula for computing it, which is given, takes into account both 
the reliability coefficient and the standard deviation for each 
group. A table is provided of Standard Errors of Measureinent for 
Given Values of Reliability Coefficient and Standard Deviation. (For 
related document, see TM 002 943, S46.) {DB)' 



o 

CZ3 



1^^/-(*/^fJt inrludcs, startinj? on page 4, 



N 



o. 



;i_Al>TnijOE,lNTELUG 



U S OEPARTMENT OF HEACTH 
gOUCATiON i WELFARE 
MATiONAL INSTITUTE OF 
EDUCATION 

Oo^OVENT "AS PI FN 
OuCrO EXACTLY Ai. CF'iFD * 

STATED 00 NOT NECESSA«iLV Rf 
SENT Of • iC(«(- NAT.ONAl .NSTiTUT 
EOV/CAT.ON' POSfT'OS OR ^Ol 'CV 



PRO 

OSS 
PRE 
t Of 



! ^ ) Test Service Bulletin 



No. 50 



THE PSYCHOLOGICAL CORPORATION 
GcoRcr K. Bennett, President 



June, 1956 



Published frovi time to tivic in the interest of promotinfi greater understanding^ of the prineiples and techniques 
of mental nwasnrnnent and its applications in 'guidance, ^tcrumnel tvork, and cUnical psijchohiitj, and for 
anuottuciu}^ new puhUcations of interest. Address conuntiHications to 304 Emt 45th Street, Sew York, S. Y. 10017. 



Harold G. Seashore, Editor 

Director of the Test Division 

Alexander G. Wesman 

Associate Director of the Test Division 



Jerome E. Doppelt 

Assistant Director 

James H. Ricks, Jr. 

Assistant Director 



Dorothy M. Clendenen 

Assistant Director 

Esther R. Hollis 

Advisory Service 



o 
o 



HOW ACCURATE IS A TEST SCORE? 

TT^VERY user of test scores knows that no test is perfectly accurate. The score on a test is determined principally by 
fH J tha ability or knowledge of the person who takes it, but the score is also affected by the inaccuracy of the test 



itself. 

It would be helpful if we could know each time we see a score whether it is higher or lower than it should be, 
and by how much. Unfortunately, no one has ever figured out a practical way to determine the precise amount of 
error in an individual case. Statistics have been developed, ho\\cvei,forestimatingihemarginof error we should allow 
for in test scores. One of the most useful of these is the standard error of tneasuretnent (SEm). 

At this point, the reader may want to ask, "Doesn't the reliability coefficient tell us how accurate a test is?" 
The reliability coefficient does, of course, reflect the test's accuracy, but it has two drawbacks: (1) its numerical 
value depends, to a great extent, on the spread of scores in the group of people tested,* and (2) it does not help 
OS directly in evaluating the scores earned by individual applicants and counselees. The SEm avoids these two 
disadvantages. Later in this article, we will show how to compute the SEm and present a table for estimating it for 
iiKxst tests. 



Let us consider a practical situation in which it would 
be useful to have a measure of the accuracy of a test score. 
Suppose we have an opening for a junior executive in 
our company. We have a large number of applicants 
and among tl;em is Henry Smith. He looks good on 
most counts, but he has a score of 28 on a test of 
administrative knowledge. The test norms show that a 
score of 32 would place an applicant within the upper 
half of all executive applicants and we desire to make 
our choice from the upper half. Since Smith looks prom- 
ising in otiier ways we begin to wonder about hiji test 
placement. 

If we could test him again, would he get 28 or some 
otJier score? Just what is Smith's true score on this test? 
Before we can make sense in talking about the difference 
between the true score and the observed or obtained 
score, wc need to .specify what we mean by true score. 



*For an ilhislration. sec Wcsinan. Alexander C. Reliability 
and C:onfidcntc. 7Va/ Service Bath'tin, No. 44, May, 1952. 



Imagine that we have a very large number of com- 
parable forms of our test. (We need not go into the 
statistics of comparable forms here; let us simply agree 
that comparable forms lire interchangeable. That is, if 
we had to choose only one form to measure administra- 
tive knowledge, we would be equally happy with any one 
of the forms.) Now suppose we were able to corner 
Henry Smith and test him with all our tremendous 
number of equivalent forms. We would find that our hero 
does not always get the same score. As the number of 
forms administered gets larger and larger, we would 
discover that the distribution of Smith's scores begins 
to resemble the familiar **normar' curve. In this situa- 
tion, we can reajionably decide that the average of the 
large number of scores is characteristic of Smith's per- 
formance on our test, and we will call this his true .score. 

At the beginning of the article we pointed out that 
the score on a test reflects primarily what the person 
tested brings to the task, but partly error of measure- 
ment in the test. The true score measures the perform- 
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ancc that is characteristic of the person tested; the varia- 
tions, plus and minus, around tlic true score describe 
a characteristic of the test. 

When wc use the standard deviation as a measure of 
the variation of observed scores around the true score, 
. the rcsuh is call<!d the standard error of measurement. 
Since this statistic has direct interpretable meaning in 
relation to the "normal" curve, we are in a position to 
make Uiis statement: 

. If wc could know both an individuafs exact true 
score and the SE.m which is characteristic of the 
test, wc would know that about 68% of the scores 
the individual obtained on the vast number of com- 
parable forms fall within one SEm of his true score. 
A band stretching two standard errors above and 
t)clow his true score would include about 95% of 
his obtained scores, and within three standard 
errors of the true score would lie over 99% of his 
scores on the many forms of the -test. 

bhviously it is useful to be able to say, putting h a 
little differently, that for about two thirds of all people 
tested, the observed scores lie within one SEm of the 
true scores — and that for nineteen out of twenty cases 
the obscr\*ed score will not be more than two standard 
errors away from the true score. 

As explained in the Note at the end of this article, we 
must be quhe careful how we make statements like the 
foregoing. It is not correct to say of nn individual with a 
certain observed score that the odds are two out of 
three that his rrwe. score is within one SEm of the score 
he got. But in the practical instance, we can use the 
SEm in defining limits around the observed score within 
which we would be reasonably sure to find the true score. 
Whether the "reasonable limits" (as Professor Gulliksen 
has called them) will be one, two, or three times the SE^ 
vyill depend on the level of confidence the test user de- 
sires. The surer he wants to be of not making a .mis- 
take in locating the true score, the broader the margin 
of error he must allow for and therefore the less definite 
and precise will be the indication given by the test. The 
broader the score band we allow for each job applicant, 
for example, the greater the likelihood that his true score 
will be within it, but the harder it will be to tell the 
applicants apart. 

Coming back to the case of Henry Smith, let us sup- 
pose tliat the test manual reveals that the SEm is 3 points. 
If wc establish "reasonable limits" of one SEai on either 
side of the observed score, the band for Smith would 
extend over the score range 25-31. And since a score 
of 32 is needed before a person may be considered as 
belonging to the top half of executive trainees, we may 
decide that Smith does not belong in the top half of the 
group. Wc are not willing to act as if his true score is 32. 



Wc could have established wider •'reasonable limits," 
say 2 or 3 SEm on cither side of the observed score. We 
would then have greater confidence that our location of 
the true score within the band is correct. This extra con- 
fidence costs U5 sonjclhing. Wc pay for it by having more 
people to be considered as possibilities. When there are 
many applicants, we usually want to reduce the number 
of eligible candidates even though we increase the pos- 
sibility of making a wrong decision about the true score 
of some of them. 

Since in practice we cannot give a large number of 
equivalent forms of a test in order to find the character- 
istic standard error of measurement, how do we de- 
termine iti The answer to this takes us back to the 
reliability coefficient. 

Asnni^ured^by the reliability coefficient, reliability 
means consistency of measurement. If the individuals of 
a group remain in about the same relative positions or 
ranks after successive testings, the test is "reliable" for 
that group. It is unfortunately true that a test will have 
different reliabilrty^coefiicients depending on the groups 
of people tested: higher coetficients for groups with a 
wide spread of scores and lower ones for groups with 
scores bunched more closely together. 

The SEm is less subject to this variation; the formula 
for computing h takes into account both the reliability 
coefficient and the standard deviation for each group. The 
formula is simple: 

sEM = sDvT^r7;; 

where SD is the standard deviation of the obtained scores 
of a group and ru is the reliability coefficient computed 
for the same group.* 

Like a true score for an individual, the SEm for a test 
should be just one definite number if it is really a char- 
acteristic of the test rather than of :hc people tested. 
But if we look in a test manual, we may see that there 
appear to be differences among standard errors of 
measurement computed for different groups. For ex- 
ample, the SEm is reported for each of nine groups on 
the Numerical Test in the Persottnel Tests for Industry 
scries. The values range from 1.7 to 2.4. The explanation 
is that we have no way of computing the exact value 
of the SEm — the formula merely provides an estimate 
of the SEm. Estimates, of course, can be expected to 
differ. In any situation where we cannot obtain the true 
value of a statistic, it is advisable to have as many es- 

♦Wc cannot automatically say that the more accurate or re- 
liable of two tests is the one which has the lower value for its 
SE\i. As niay be seen from the computing formula, the SEm is 
tied In with the score units in which the standard deviation is 
expressed. A test wilh a standard deviation of 16 points may 
have the same reliability as a test with a standard deviation of 
8 points. However the ^\L\x of the firM test will be numerically 
twice that of the second. 
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Sfcindard Errors of Measurement for Given 
Voiues of Reliability Coefficient ond 
Standard Deviation 



SD 


Reliability Coeflicient 


.95 


.90 


.85 


.80 


.75 


.70 


30 


6.7 


9.5 


11.6 


13.4 


15.0 


16.4 


28 


6.3 


8.9 


10.8 


12.5 


14.0 


15.3 


26 


5.8 


8.2 


10.1 


11.6 


13.0 


14.2 


24 


5.4 


7.6 


9.3 


10.7 


12.0 


13.1 


22 


4.9 


7.0 


8.5 


9.8 


11.0 


12.0 


20 


4.5 


6 3 


7.7 


8.9 


10.0 


11.0 


18 


4.0 


5.7 


7.0 


8.0 


9.0 


9.9 


16 


3.6 


i.1 


6.2 


7.2 


8.0 - 


8.8 


14 


3.1 


4.4 


5.4 


6.3 


7.0 


7.7 


12 


2.7 


3.8 


4.6 


5.4 


6.0 


6.6 


10 


2.2 


3.2 


3.9 


4.5 


5.0 


5.5 


8 


1.8 


2.5 


3.1 


3.6 


4.0 


4.4 


6 


1.3 


1.9 


2.3 


2.7 


3.0 


3.3 


4 


.9 


1.3 


1.5 


1.8 


2.0 


2.2 


2 


.4 


.6 


.8 


.9 


1.0 


1.1 



This tabic is based on the formula SHu - SD \/ 1 - r,i. For 
most purposes the result will be sulTicicmly accurate if the table 
is entered wiih the teliability and standard deviation values 
nearest those given in the test manual. Be sure the standard 
deviation and the reliability coefficient arc for the same group 
of peopk 



timates of that value as practical. In the case of PTI- 
Numerical, we can be comfortable with the conclusion 
that the SEm is about 2 points. 

Many test manuals give both reliability coefficients and 
standard errors of measurement for the convenience of 



the user. When the SE^, is not uivcn, it c;in he csiiniatcd 
readily by use of tlic rcliabiiity cocllicicnt. provided the 
manual also states the standard deviation ol tlio par- 
ticular group of people on \\ liich the f liability cocllicicnt 
is based. It is v dl worili the test user's time to make this 
computation; ihe table at the lc*"t permits an approxima- 
tion to be made easily witnout any liguring. 

If, as is too often the case, the manual docs not present 
the standard deviation o^ the group for which the re- 
liability coeflicient is repo.tcd. it would he advisable for 
the user to write a letter to the test author.-J. E. D. 



NOTE: As textbooks usually point out. it is correct to make 
a statement of prohaNlily (such as "689^ of the scores" or 
"two out of three times") only when the :>En is applied 
to the true score. If a test has a standard error of 5.5, it is 
not correct to say of a person who obtains a score of 48 that 
the chances arc two out of three that hib true score is be- 
tween 42.5 and 53.5i This person's true score is a definite 
number, although we do not kno.v what it is. The statement 
that his true score lies between 42.5 and ^3.5 is cither tr'je o*- 
fal.se. Intermediate proHabilitics like "two out of three" or 
"one out of twenty" cannot properly be attached to it. The 
"reasonable limits" idea siiapJy helps us to avoid making a 
mathematical statement of probability which would be tech- 
nically inaccurate. Precise statements of probability in rela- 
tion to confidence intcrvi^ls are possible but lie outside the 
scope of this article. 

Readers who want to pursue this and other fine points 
regarding the standard error of measurement will find good 
treatments in, among others, the followiiig texts: 
H. Gulliksen. Theory of mental tests. New York: Wiley, 
1950. 

T. L. Kellcy. Fimdamentals of statistics. Cambridge: Har- 
vard University Press, 1947. 

E. F. Lindquist. A first course in statistics. Boston: 
Houghton MiOlin, 1942. 



A Book of BASIC READINGS 
ON THE MMPI 

This book, edited by G. S. Welsh and W. G. Dahl- 
strom, brings together in one place 66 of the most im- 
portant articles on the Minnesota Multiphasic Personal^ 
ity Inventory that have appeared in its fifteen years of 
steadily widening use. More than 600 additional articles 
are listed in the bibliography, plus nearly 200 supple- 
mentary references. 

The articles are grouped in ten sections: Theory, Con- 
struction, Coding, New Scales, Profile Analysis, Diag- 
nostic Profiles, Psychiatric Problems, Medical Problems. 
Therapy, and General Personality! xviii + 656 pages. 



